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Abstract: 


Given the dynamic and convoluted nature of numerous datasets, the necessity of enhancing 
performance outcomes and handling multiple datasets has become more challenging. To handle 
these issues effectively and improve the quality of multiple approaches, the capabilities of 
various Machine Learning techniques such as K-Nearest Neighbor (KNN), Logistic Regression 
(LR), Naive Bayes(NB) and Support Vector Machine (SVM) have been utilized in this study. In 
this paper, the binary classification method using five different datasets, and many predictor 
variables have been utilized. Moreover, this research has mainly focused on determining the 
classification of data into the subsets that share the standard designs. In this regard, many 
approaches had been studied extensively and used to achieve better yields from the existing 
literature; however, they were inadequate to provide efficient outcomes. By applying four 
Supervised ML classification algorithms along with the UCI Datasets of ML Repository, the 
robustness of the method is progressed. The proposed mechanism is assessed by adopting five 
performance criteria concerning the accuracy, AUC (Area Under Curve), precision, recall, and 
F-measure values. The current study experimental results revealed that there is a significant 
improvement in the confusion matrix rate compared with a similar study and this method can 
also be used for machine learning problems such as binary classification. 


Keywords: Machine Learning, Data Mining, K-Nearest Neighbor, Logistic Regression, Naive 
Bayes, Support Vector Machine 


subsets. ML has been widely used in a variety 
of industries, such as Remote Sensing, Image 
Classification, and Pattern Recognition. 

ML can learn and improve automatically from 
experience, without explicit programming. It 


1. Introduction 


The combination of classifiers is now an 
active research area in the ML and Pattern 
Recognition [1][2][3]. Many theoretical and 


empirical studies have been published which 
show the advantages of the combination 
paradigm over the individual classifier models 
[4][5]. A significant number of researches 
have been conducted to design multiple 
classifier systems based on the same classifier 
models trained on different data or feature 


is the primary aim to automate learning 
without human intervention. ML algorithms 
use statistics to find patterns in massive 
amounts of data [6]. Whereas the algorithms 
which are used in this research are briefly 
described below: 
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Firstly, KNN: a simple algorithm that stores 
all available cases and classifies new cases 
based on a similarity measure. It is used for 
statistical estimation and pattern 
recognition[7][8]. 

Secondly, LR: a standard statistical approach 
that is ideal for performing regression analysis 
where the dependent variable is binary. It is 
used to describe the data and to explain the 
relationship between one dependent binary 
variable with one or more independent 
nominal, ordinal, interval, or ratio-level 
variables [9]. 

Thirdly, the NB classifier: combines the 
Bayes paradigm with the decision rules like 
the hypothesis, which provides satisfactory 
results. It applies Bayes theorem, with the 
naive assumption of conditional independence 
between each pair of features given the value 
of the class variable. In [10], proposed the NB 
learning framework for large-scale 
computational efficiency and multi-domain 
platform classification. 

Fourthly, SVM: is a paradigm that uses 
classification algorithms for two-group 
problems. It is accuracy and predictive 
performance on the survival of traumatic brain 
injuries performed significantly better than 
LR [11]. 

On the other hand, this paper has structured 
with several sections. In section 2, previous 
related work is described briefly. The 
methodology adopted for performing different 
experiments is explained in Section 3. Section 
4, provides experimental work, datasets detail, 
evaluation of experiments is performed to 
obtain different results. Lastly, certain 
conclusions are drawn based on the outcomes 
and future work is suggested in Section 5. 


2. Related Work 


Classifications based on KNN, LR, NB, and 
SVM has recently witnessed a surge of 
research efforts. In this paper, we have used 
the classification of supervised learning. 
Moreover, in the literature, classification 
algorithms could be affected significantly or 
negatively by some features [1][2]. The goal 
of classification is to accurately forecast the 


target class for each case in the data. Whereas 
in the model build training process, a 
classification algorithm co-ordinates between 
the values of the predictors and the values of 
the target. Different classification algorithms 
execute different procedures for discovery 
associations. These associations are model, 
which can function to a different dataset in 
which the class is unidentified [12] [13] [14]. 
In [15], KNN is the slowest classification 
technique because the classification time is 
directly proportional to the number of data. 
When the data size is more prominent, more 
extensive distance calculation should be 
performed to make it extremely slow. 
Moreover, it uses the number of nearest 
neighbors “k” as one of the parameters in 
classifying an object, and the value of k 
influences the classifier performance [16]. 

In [17], Cubic SVM, Quadratic SVM, and 
Linear SVM have better performances in 
predicting the outcome of traumatic brain 
injury as compared to LR. 

In [18], NB is the most popular data mining 
algorithms. Empirical results indicate that the 
selective NB demonstrates superior 
classification performance while retaining the 
simplicity and flexibility at the same time. 
SVM is a useful method for solving 
classification and regression problems. In 
[19], the SVM approach can substantially 
improve prediction accuracy and would help 
to mitigate the adverse impact on urban 
expansion. 


3. Methodology 


This section presents an overview of the 
proposed method, which describes the pre- 
processing stage of data and classification 
algorithms. 


3.1. Overview of the Proposed System 


An overview of the proposed system is given 
in Fig. 1. This system consists of numerous 
phases: datasets, base learners, and 
comparative analysis of the results. Besides, 
the generalization performance of the system, 
10-fold cross-validation is used for all 
classifier learners and datasets. 
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Fig.1. The framework of the method. 


3.2. Data Preprocessing 


In this phase, the ranges of the values of the 
data from different ML datasets may be high. 
In such a scenario, certain features can 
significantly or negatively affect algorithms 
for classification accuracy. Therefore, the data 
values are normalized to [0,1] range using 
min-max normalization technique [20]. 


3.3. Classification of Algorithms 


In this study, four base learners, including 
KNN, logistic regression, NB, and SVM, are 
employed. 

There are numerous phases of methods related 
to the datasets and classifiers focused on ML. 
In this work, four ML classifiers, along with 
several datasets, are experienced for binary 
classification. 

LR classifier relies on feature extraction. 
Typically, it delivers more authentic results 
than KNN, NB, and SVM. The primary aim 
of this analysis is to establish the classification 
accuracy and performance evaluation of 
multiple datasets. 

The KNN classifier does not have a 
specialized training phase and uses all the data 
for training during classification and it does 


not assume anything about the underlying data 
[15]. 

LR classifier is another method borrowed by 
ML from the field of statistics. It is a statistical 
model and used when the dependent variable 
is categorical. 

NB is a probabilistic ML model. It requires 
linear parameters in the number of functions 
of the variables and highly scalable [18]. 
SVM is an ML algorithm that can be used for 
classification problems as well as for 
regression. It is segregated in two classes and 
co-ordinates the individual observation. 


4. Experimental Design 


In these subsections, we describe and present 
the experimental process, evaluation 
measures, and experimental results. 


4.1. Experimental Process 


In the experimental process, five datasets have 
been used from the UCI ML Repository [21]. 
All experiments are performed on a total of 4 
ML classifiers by using WEKA (Waikato 
Environment for Knowledge Analysis) ML 
toolkit and JAVA programming language 


SJCMS | E-ISSN: 2520-0755 | Vol. 4 | No. 1 | © 2020 Sukkur IBA University 


47 


Abro (et al.), Identifying the Machine Learning Techniques for Classification of Target Datasets 


(pp. 45 - 52) 


[22]. On the one hand, we have utilized default 
parameter values for all the classifiers in 
WEKA. 

On the other hand, we have carried out 10-fold 
cross-validation to all datasets to yield reliable 
results. The 10-fold cross-validation is 
imposed on the original dataset randomly 
partitioned into 10 equally sized sets, one of 
which is used as test validation, while the 
remaining sets are used for training 
operations. The process is repeated 10 times 
and calculates the averages of the results. 
Dataset characteristics are evaluated 
concerning the attributes and the number of 
instances. These datasets are typically used to 
solve ML related issues. There are various 
numerical attribute descriptions illustrated in 
Table 1. The number of instances, attributes, 
and classes for each dataset are presented in 
Table 1. The datasets are selected from the 
UCI ML Repository according to their distinct 
parameters. It is determined by investigating 
the appropriate data or datasets which are 
being utilized for binary classification 
problems. 


Table 1. Characteristics of the five Datasets 


Used in This Study 
Datasets Instances | Attributes | Classes 
Annealing 898 39 6 
Breast 286 10 2 
Cancer 
Hepatitis 155 20 
Vertebral 240 7 
Yeast 1484 9 10 


In this work, four different ML approaches 
have been carried out along with the five 
datasets, which are considered suitable for the 
classification. However, the performance 
metrics are calculated according to the binary 
classification problems based on the 
confusion matrix. 


4.2. Assessment of Measures 


This section describes the five performance 
evaluation measures of the proposed method, 
consisting of accuracy, AUC, precision, recall 
and F-measure. 


Accuracy reflects how close an agreed number 
is to a measurement. It is specified further in 
Eq. 1. 


_ TP + IN (1) 
TP + TN + FP+FN 


In equation 1, TN, FN, FP and TP show the 
number of True Negatives, False Negatives, 
False Positives and True Positives. 

AUC represents the area under the ROC 
Curve. AUC calculates the whole two- 
dimensional area beneath the whole ROC 
curve from (0,0) to (1,1). 

Precision is a positive analytical value [1][23]. 
Precision defines how reliable measurements 
are, although they are farther from the 
accepted value. 

The equation of precision is shown in Eq.2. 


ACC 


RAA TP 
Precision= 
TP + FP 


(2) 
The recall is the hit rate [1][23]. The recall is 
the reverse of precision; it calculates false 
negatives against true positives. The equation 
is illustrated in Eq. 3. 


TP 


Recall = ———— 
TP +FN 


(3) 
F-measure can be defined as the weighted 
average [1][24] of precision and recall. This 
rating considers both false positives and false 
negatives. The equation is illustrated in Eq. 4. 


4 


F — measure = 

1/precision + 1/recall (4) 
These criteria are adjusted proportionally in 
the data by the reference class prevalence in 


the weighting operation. 


4.3. Experimental Results 


Tables 2-6 for all datasets present accuracy, 
AUC, precision, recall, and F-measurement 
weighted values with ML algorithms. In Table 
2-6, high Acc, AUC, Precision, Recall, and F- 
measure are shown in Bold, while the greyed 
shows insufficient results. 

To sum up, Tables 2-6, has been designed in 
terms of different specifications according to 
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the multiple datasets relating to the numerous Table 4: Weighted Values for Hepatitis 
approaches of ML. In Table 2, LR has better Dataset 

outcomes, which provides 99.1091% Acc z 

when comparing with others. Hepatitis : 

Probably, in Table 3, KNN indicates Methods| Acc (%) | AUC | Precision | Recall es 
72.3776% Acc adequate consequences. KNN | 80.6452 0.806 
Similarly, in Table 4, the NB presents LR | 82.5806 |0.802) 0.814 | 0.826] 0.818 
84.5161% Acc effective results. Whereas, in NB 84.5161 |0.860| 0.853 | 0.845 | 0.848 
Table 5, the SVM illustrates the 92.9167% SVM 0.731] 0.814 0.802 


Acc productive outcomes. However, in the 
end, LR shows a 58.6253% Acc result in 
Table 4. 


The annealing, hepatitis, and vertebral 
datasets have significant outputs concerning 
the accuracy, AUC, precision, recall, and F- 
measure parameters in Table 2, 4, and 5; 
however, breast cancer has somehow 
satisfactory output in Table 3 and yeast shows 
lower outcomes in Table 6. 

Furthermore, it is analyzed that LR for 
annealing dataset in Table 2, provides a more 
accurate outcome. Likewise, KNN in breast 
cancer dataset concerning Table 3, indicates 
adequate consequences and in Table 4, NB 
presents effective results in the Hepatitis 
dataset. In addition, in Table 5, Vertebral 
dataset SVM provides positive findings. 
Finally, LR indicates the progressive result in 
Table 6, yeast dataset. 


Table 2: Weighted Values for Annealing 


Dataset 

Annealing 
Methods| Acc (%) | AUC |Precision| Recall F- 

Measure 

KNN | 99.1090 |0.985| 0.991 | 0.991 | 0.991 
LR 99.1091 |0.992} 0.991 |0.991 | 0.991 
NB 86.3029 |0.957| 0.933 | 0.863] 0.882 
SVM 


Table 3: Weighted Values for Breast Cancer 


Dataset 
Breast Cancer 

Methods | Acc |AUC| Precision | Recall F- 

(%) Measure 
KNN_|72.3776|0.628| 0.699 | 0.724} 0.697 
LR 68.8811|0.646| 0.668 | 0.689 | 0.675 
NB 71.6783 |0.701| 0.704 | 0.717] 0.708 
SVM 


Table 5: Weighted Values for Vertebral 


Dataset 
Vertebral 
Methods |Acc (%)| AUC | Precision | Recall F- 
Measure 
KNN | 85.4167 0.854 | 0.853 
LR 92.5 |0.930} 0.919 | 0.925] 0.920 
NB 0.854} 0.886 
SVM_|92.9167/0.788} 0.924 | 0.929) 0.925 


Table 6: Weighted Values for Yeast Dataset 


Yeast 
Methods| Acc (%) | AUC |Precision| Recall F- 
Measure 
KNN 0.524 
LR 58.6253 |0.825| 0.585 | 0.586 | 0.577 
NB 57.6146 |0.816} 0.585 |0.576| 0.566 
SVM | 58.2884 | 0.785 0.583 | 0.602 
ANNEALING DATASET WEIGHTED VALUE 
100 
40 | 
10 | 
mKNN mLogigicRegession mNaveBayes m SVU 


Fig.2. The chart is showing the effects of the 
Annealing dataset 


In Fig. 2-6, indicates the enhanced 
classification and performance evaluation 
based on the datasets provided in the 
following mentioned charts. The LR, 
Annealing dataset has higher accuracy 
followed by KNN, NB, and SVM, in Fig. 2. 
Moreover, in Fig. 3, KNN, Breast Cancer 
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dataset, provides better outcomes after LR, 
NB, and SVM. Likewise, in Fig. 4, NB 
efficiency, Hepatitis dataset, yields efficient 
outputs as compared to LR, KNN, and SVM w» 
sequentially. Whereas, SVM, vertebral 
dataset, has higher accuracy in contrast to 
LR, KNN, and NB in Fig.5. Lastly, in Fig. 
6, the LR, Yeast dataset, has outperformed 
than SVM, NB, and KNN. 0 


BREAST CANCER DATASET WEIGHTED VALUE 


60 
50 
40 
30 
20 
10 
Oo — — — — 


ACC (%) AUC PRECISION RECALL F-MEASURE 


wKNN  mLogistic 


Fig.3. The chart is showing the effects of the 


Breast Cancer dataset. 
a 
60 
40 
30 
20 
10 


ACC (%6) 


n mNaveBayes mSVM 


HEPATITIS DATASET WEIGHTED VALUE 


AUC PRECISION RECALL F-MEASURE 


mKNN mLlogisticR N NaveBayes m SVM 


Fig.4. The chart is showing the effects of the 


Hepatitis dataset. 
90 
70 
40 
i | | 


ACC (%6) 


VERTEBRAL DATASET WEIGHTED VALUE 


AUC PRECISION RECALL F-MEASURE 


wKNN mLogisticRegression mNaveBayes m SYM 


Fig.5. The chart is showing the effects of the 
Vertebral dataset. 


YEAST DATASET WEIGHTED VALUE 


ACC (%) AUC 


RECALL 


PRECISION F- EASURE 


@KNN mLogistic Regression Nave S mS 


Fig.6. The chart is showing the effects of the 
Yeast dataset. 


5. Conclusions And Future Work 


Based on the experimental and numerical 
results, the main findings of this research work 
can be summarized as follows: 

In this paper, we have examined the 
implementation of four ML algorithms which 
are named as k-nearest neighbors (KNN), 
Logistic Regression (LR), Naive Bayes (NB), 
and Support Vector Machine (SVM) to 
classify multiple datasets. The efficiency of 
algorithms is further demonstrated in terms of 
precision, recall/sensitivity, accuracy, and F - 
score. Whereas many ML algorithms are 
unable to provide satisfactory results as they 
are dependent on the datasets. The sensitivity 
of the same algorithm can be severely affected 
by analyzed varying sizes of training and test 
sets. 

Generally, LR has more successive 
consequences than KNN; whereas, in most 
datasets, the NB delivers more effective 
outputs than SVM. There is no winner outright 
in terms of the performance outcomes; it 
depends on the characteristics of the datasets, 
the simulation, and the circumstances. 

In the future, we plan to reform our study of 
classification models by introducing the 
Intelligent ML algorithms, which are more 
useful to an extensive collection of real-life 
datasets. 
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